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Summary. Exponential-family random network (ERN) models specify a joint representation 
of both the dyads of a network and nodal characteristics. This class of models allow the 
nodal characteristics to be modelled as stochastic processes, expanding the range and real- 
ism of exponential-family approaches to network modelling. In this paper we develop a theory 
of inference for ERN models when only part of the network is observed, as well as specific 
methodology for missing data, including non-ignorable mechanisms for network-based sam- 
pling designs and for latent class models. In particular, we consider data collected via contact 
tracing, of considerable importance to infectious disease epidemiology and public health. 



1. Introduction 



It is not uncommon for researchers to collect data on a subset of a single network rather than 
observing the full network. This partially observed case has been studied within the frame- 



work of exponential- family random graph models (ERGM) by Handcock and Gile (20101, 
however their formulation suffers from the limitation that any nodal attributes included in 
the model must be fully observed, and only dyads may be missing. This assumption is not 
met in most sampling designs, where only some of the nodes are surveyed by the researcher, 
and reduces the practical usage of ERGMs in the missing data setting. 

By including nodal attributes as variates rather than fixed quantities, exponential-family 



random network models (ERNM, Fellows and Handcock 20121 can provide a convenient 
basis for inference in cases where the data is partially unobserved, either due to design, or 
out-of-design (e.g., non-response) mechanisms. While our framework is applicable to all par- 
tial observation mechanisms we consider three common mechanisms for partial observations 
in more detail, specifically: 



Missing Data: If the population is comprised of a large number of units, or the number 
of edges is large, it is relatively common to find that the resources to observe a full 
network are not available. Often units or dyads are unavailable for sampling or do 
not provide complete responses to a survey instrument. In this case, only some of 
the dyads and nodal characteristics are collected. We treat missing data as a form of 
sampling in which the sampling mechanism is unknown and outside the control of the 
researcher, or an out-of-design missing data mechanism. A good example of this is 
the National Longitudinal Study of Adolescent Health (Add Health), a school-based, 
longitudinal study of the health-related behaviours of adolescents and their outcomes 
in young adulthood. The study design sampled 80 high schools and 52 middle schools 
from the U.S., representative with respect to region of country, urbanicity, school size. 



school type, and ethnicity (Harris et al. 2003). In 1994-95 an in-school questionnaire 
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was administered to a nationally representative sample of students in grades 7 through 
12. In addition to demographic and contextual information, each respondent was 
asked to nominate up to five boys and five girls within the school whom they regarded 
as their best friends. Thus each student could nominate up to ten students within the 
school (Udry 2003). The nominations and contextual information were not available 



for some of the adolescents, either due to absence from school while the survey was 
being conducted, or refusal to participate. Thus, both the graph and nodal variates 
contained missing values. 

Netv^rork sampling designs: Many studies in hard to reach populations use study designs 
that trace the linkages of an underlying social network. In these designs, the network 
is partially observed, however it is not of primary interest to the researcher. Such 



sampling designs have been exploited to estimate population disease rates (Gile and 



Handcock 


2010 


Gile 


2011 



Gile and Handcock 2011) 



Latent variables: Some quantities of the network may be in principle unobservable. The 
probability model for a network may posit the existence of unknown variables which 
do not correspond to any observable quantity. For example, stochastic block models 



(Nowicki and Snijders 2001 ) posit the existence of classes of nodes, conditional upon 

These classes are unobserveable nodal characteris- 
Similarly, latent position cluster 



which the dyads are independent, 
tics and must be inferred from the relational data. 

2006| ) posit the existence of unobservable continuous nodal 



models (Handcock et al. 



quantities that provide a spatial geometry for the network structure. 

In this paper we develop approaches for each of these scenarios in the context of ERNMs. 
Sections [2] through |4] introduce ERNM and extend the theory to incorporate partially ob- 
served populations. Section [5] develops methodology for each of the scenarios. Sub-section 
|5.1| looks at the effect of random non-response, and sub-section |5.2| applies a latent class 
model to extract unknown clusters from a real data-set. Sub-section |5 .3| develops estimates 
based on contact tracing designs, which is of vital importance to the public health commu- 
nity. To our knowledge, the methods outlined in this paper represent the first statistically 
justifiable approach to inference from contract tracing data. 



2. Exponential-family random network models 



Exponential- family random network models (Fellows and Handcock 20121 are a general 



isation of the exponential- family random graph model (Frank and Strauss 1986 Hunter 



and Handcock 2006), where both dyads and nodal characteristics are treated as random 
variates. Formally, in a population of n units, let Yij indicate that unit i has a tie to unit 
j. Let F be an n X n matrix [Yij] and X he a. a.n n x K matrix [Xj^] of unit covariates. 
We define a network T as the union of the nodal covariates and the graph structure (i.e. 
T = {X, Y}). An exponential family model of T is expressed as 



1 

c{r],T) 



t G r, 



(1) 



where 77 S i?'' is a vector of parameters, g is a q— vector valued function defining a set 
of sufficient statistics, T is the sample space of networks and 0(77, T) = Y^igT e'''^^*^ is the 



normalising constant. This model is developed in Fellows and Handcock (2012) 
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2. 1. The Simple Homophily Model 

Though any set of network statistics can be represented by g in equation ([T]) , the examples in 
this paper will focus on a particularly parsimonious, but powerful, network model. Suppose 
that X = {Xi, . . . , Xn) is a univariate categorical variable with m levels, labelled 0, . . . , m — 
1. If Xi = I we say that unit i is in group I. A joint model for X and Y is 



1 



The first term of this model is the number of edges, and controls the density of the graph. 
The last term represents the number of nodes in each category of x , except for the last 
level, which is dropped to maintain identifiability of the model. The second term h is the 



regularised sample homophily of x, as introduced by Fellows and Handcock (2012), and is 
defined as 

m— 1 

h{y,x) = 



E 



k—O i:Xi — k 

where di^k{y, x) is the number of edges between node i and nodes in group k, and E^{f{Y, X)) 
is the expectation of the statistic f{Y, X), conditional upon Y — y and the category counts 
(that is, the number of nodes in each category of x, n{x) — {nk{x)}j^^i), assuming that 
X and Y are independent. Thus, each term in the sum is the square root of the number 
of neighbours of a node which share the same category, minus what would be expected 
by chance. Using this form of homophily avoids the degeneracy problems found in other 
formulations. For a more thorough justification, see Fellows and Handcock (2012). 



While the examples in this paper focus on applications of the simple homophily model, 
the framework presented here applies to any arbitrary set of network statistics g. For 
example, in many applications the nodal attributes are multivariate, and their relationships 



are of interest to the researcher. Fellows and Handcock (2012 1 developed a network statistic 



that can be interpreted as a conditional logistic regression term which, if included, can model 
the relationship of several categorical variates. 



3. Likelihood-based Inference from Partially Observed Networks 

In this section we develop likelihood-based inference for network models based on partial 
observation of the networks. The approach allows non-ignorable sampling mechanisms for 
the networks, including some common network-based sampling designs. 



Handcock and Gile (2010) developed a theory of missing data for ERG models, and the 



specification for ERN models proceeds similarly, though our formulation supports a more 



general class of missingness processes known as missing not at random (MNAR; see Rubin 



1976). Let Tobs and Tmiss represent, respectively, the observed and unobserved part of the 
complete network T.. We write T — {Tots, Tmiss) , with realisations t = {tobs, tmiss)- Let W 
be a random variable representing the sampling process with realisation w. The probabilistic 
distribution of W is the sampling mechanism, and must fully specify the sample selection 
process, including the partition of T into Tobs and Tmiss- Typically, W will consist of an 
n by n matrix indicating whether the dyad was sampled, and an n by K matrix indicating 
which nodal attributes are missing; however, W may contain additional information about 
the sampling, such as the order of sampling. 
We write the full data likelihood as 
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p{T ^t,W ^ w\7], e) = p{W = w\T = t, ( 



1 



and we wish to draw inferences about rj from the observed data likelihood, defined as 



P{Tobs = tobs, W = w\r], 6*) = ^ p{W = W\t = {tobs, tmiss), 



1 



(2) 



This probabiHty model jointly represents the distribution of the network T, and the 
sampling process W. The functional form of p{W = w\T = t, 9) is dependent on the form 
of missingness, and will differ depending on how Tobs was obtained. Section 5.3 illustrates 
a design of particular interest known as biased seed link tracing. When the sampling 
probabilities only depend on the observed data, then the sampling design is amenable to 
the model (Handcock and Gile 20101, and is ignorable in the sense of Rubin (1976). In this 
case, the likelihood simplifies to 



P{W = w\Tobs = tobs.i 



c{v,T) 



p{W = w\Tobs = Us, d)J2 7(^ 

y — 



„')-ff((to(,= ,t™,aa)) 



c(77, T) 



tmis 



c{r],T) 



(3) 



Thus, when the sampling process is ignorable, inferences on rj are not affected by p{W = 
w\Tobs = tobs, 0), and so knowledge of the sampling process is not essential for the process 
of inference. 

Having defined the full and observed likelihood, it is also useful to define the missing 
data likelihood: 



p{T^iss — ^mzssl^^ — ^,Tobs — tobs:V 

where 



p{W = W\T = {tobs,t„^^ss),0)e^■0^^'^^-'^"^^^ 



c{tobs,w,r],e) 

c{tobs,w,v,9) - P{W ^w\T^{tobs,t„,,ss),0)e'^-'^^'"'-''--^^. 



The (observed data) likelihood can then be rewritten as the ratio of two normalising 
constants 



piTobs = tobs,W = w\ri,i 



^ p{W = W\T = {tobs, tmiss), 



c{tobs,w,-q,6) 



„vg{{tobs,t,„iss)) 



c{ii,T) 

and using this, we may write the observed data log likelihood ratio of (ry,6') versus (ryoj^'o) 
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as 

fff a\ cr a \ ^ r c{tobs,w,ri,9) c{ri,T) . 

-log(£;,„(e(''-''«)-3(^))) 
= log(i?,„,,,(e(''-''°)-^'(^)|r„,, = - log(£;,„(e(''-''«)-s(^))) 

= w\T,eo)\Tobs = tobs) ^ ' 

Both equation (4) and equation ^ motivate algorithms to draw inferences about 77 and 
6. Section [4] describes the algorithm motivated by equation (4), and Appendix A.l outlines 
an algorithm using equation ([S]). 

4. Calculating the MLE with MCMC 

For most models, equation (4) is not analytically solvable. However we may approximate it 
by Markov Chain Monte Carlo (MCMC). Let i^'^ and t'h^ where i G (1, . . . , M) be samples 
from the full likelihood and missing data likelihood respectively with parameters f^Oi^o- 
Then equation (4) may be approximated by 

6) - i{r,o, eo) « log(^ Y: ? ^''"'°'^'*^'^) - E e("--'-(*"'') (6) 

As rj,6 move away from riQ,0Q the quality of this approximation degrades. Because we 
will be optimising equation (4), it is useful to have both the first and second derivatives of 
the log likelihood, which are 

6i 

— = Enfi{gi{t)\Tobs = tobs, W ^ w) - Er,^e{gi{T)) 

= -cov(g, (T),.g,(T)) + co^ {g,{T),g^{T)\Tobs = U., = w). 

The expectations and covariances in these derivatives can be approximated using the condi- 
tional and unconditional MCMC samples and thus we can then use the following algorithm 
to approximate the MLE. 

(a) Let fc = and choose initial parameter values 77^°-', 9o. 

(b) Use MCMC to generate k samples, t^^^^^ from P{T„dss = trrdss\ri'' ,Tobs = tobs,W = 
w). 
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(c) Use MCMC to generate m samples t^'^ from P(T = t\r]''). 

(d) Using the samples from step 2 and 3 in equation (|6|, find rj''^^ , 6''''+^ maximising the 
likelihood ratio, subject to \\r]'''^^ — v'^W < ^ ^nd ||^+^ — (^''W < £• 

(e) If the likelihood has not converged, set k = k + 1 and go to step 2. 

(f) Let the MLE estimate he i) ^ yy'^+i and 9 = 9''+^ 



Asymptotic standard errors for fi may be obtained using an MCMC approximation to the 
Fisher information (i.e. the second derivative of the log likelihood). While asymptotics of 



the Fisher information are not assured with respect to ERNM (or ERGM) models, jFellows 
and Handcock ( 2012[ ) show strong empirical agreement between the Fisher information 



standard errors and parametric bootstrap simulations. Standard errors for the mean value 
parameters ji = E{g{T)\'q ~ rj) can be approximated by MCMC sampling. 



5. Specific forms of partial observation 

In this section we consider the three common forms of partial observation considered in 
the introduction, each corresponding to a different mechanism of partial observation or 
conceptualisation of that mechanism. 



5. 1 . Missing Data: Unobserved Relational Information 

It is common when surveying networked populations that there are insufficient resources to 
conduct a census of the population and their relations. For efficiency reasons, a sampling 
based survey is undertaken, or the full network is partially observed due to non-response. 
In this sub-section, we give an illustration of the effect of non-response where the dyad 
information is missing completely at random. We consider the relations of "liking" among 



18 monks in a monastery (Sampson 1969). The network analysed has a directed edge 



between two monks if the sender monk ranked the receiver monk in the top three monks 
for positive affection in any of the three interviews given over a twelve month period (Hoff 
et al. 2002 ) . The sociogram of this data-set is shown in Figure [l] One nodal attribute of 



interest is an indicator of attendance at the minor "Cloisterville" seminary before coming 
to the monastery. 

We fit a simple homophily model on Cloisterville status using the full data. Wc then ran 
simulations on the effect of missingness by selecting dyads, and Cloisterville status variates, 
completely at random and setting them to missing. Figure [2] shows one simulated missing- 
ness pattern with 15% missing. We ran 100 simulations at each missingness percentage. 
Means and standard deviations of the ERNM models fit to these simulated missingness 
patterns are displayed in Figure [3j 

We see that the standard deviations of the estimates increase as the amount of miss- 
ingness increases. At the higher missingness levels some bias is apparent relative to the 
full data MLE, but not more than one standard deviation. One possible explanation for 
this bias is that there were only six monks who attended Cloisterville, and so at 50% miss- 
ingness, a significant number of samples will include no (or perhaps a single) Cloisterville 
monks. 
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Fig. 1. Relationships among monl<s within a monastery and their affiliations as identified by Samp- 
son: Young (T)urks, (L)oyal Opposition, and (O)utcasts. 
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Fig. 2. Sampson's monk's with 15% missingness. Cloisterville status marked on the right hand side. 
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Fig. 3. Means and standard deviations of model estimates. Red lines indicate fully observed MLE 



5.2. Latent Variables: Stochastic Block Models 

In this sub-section we consider the situation where some characteristics of the network are 
posited but unobserved. Specifically, we consider the case where each node of the network 
belongs to a latent class, and the structure of the network depends on that latent class. The 
traditional approach to this has been stochastic block models Nowicki and Snijders (2001 1, 
and here we show how these models fall naturally out of our general formulation. 

It is apparent from Figure [l] that the pattern of "liking" between the monks may exhibit 
clustering. Through close sociological study, Sampson ( 1969 ) identified three clusters which 
he dubbed the Turks, Loyal Opposition and the Outcasts (see: Figure [T]). Here we will 
attempt to identify clusters by inferring class membership from the graph. We fit the 
simple homophily model of Section |2.1| to this data, assuming a class covariate, X, with 
three levels, and that all of the monks are "missing" their class covariate. The simple 
homophily model treated this way represents a novel latent block model in the spirit of 



Nowicki and Snijders (2001 ). Note that the missingness process here is ignorable because it 



does not depend on unobserved quantities as all of the x values are missing regardless of the 
Y values. We fit the model using the algorithm in Section [4] Table [T] shows the maximum 
likelihood parameter estimates, along with standard errors of the estimators based on the 
Fisher information. 

The natural parameter estimates indicate significant homophily in tie formation based 
on the class. It also indicates that the number of monks in the third class is significantly 
more than those of the other two classes, which are not statistically significantly different 
in size. The mean value parameters indicate that the expected number of ties is about 88, 
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Table 1. Latent Class model for Sampson's monks. 



Term 


V 


A 


s.e.(77) 


s.e.{fi) 


# of edges 


-0.58 


88.23 


0.14 


7.48 


Homophily 


7.28 


15.30 


0.91 


1.33 


# in group 


-2.50 


3.95 


1.44 


1.08 


# in group 1 


-0.02 


6.95 


1.31 


0.99 



and the expected numbers in the three groups are 4, 7 and 7. 

An advantage of this approach is that we can investigate the probabihty of class mem- 
bership, which is well defined through our framework as p{X = x\Y = yobs, v)- To compute 
p{X — x\Y — yobsjV) simulated a large number of samples from p{X — x\Y — yobsjV) 
using MCMC to show the probability of the monks being in the classes displayed in Figure 
[l]to be above 0.9999. These clusters were also identical to those chosen by [Sampson (1969) 
and verified by later research Breiger et al. (1975); Handcock et al. (2006). 

In addition to assuming a set number of latent classes for the model, we can also use 
the MLE procedure to select an appropriate number of clusters for the data. We fit the 
simple homophily model with a latent variable X able to take a potentially large number 
of values (e.g., the number of monks). In this case p{X = x\Y = yobs, places zero mass 
for all but three of the groups. This is evidence that the three groups we have identified 
are a good classification for these data. More sophisticated model selection approaches for 
choosing the number of clusters are possible (Handcock et al. 2006), and are left for future 
work. 

Our form of the stochastic block model is conceptually very clean with the ability to 
naturally incorporate additional covariates, multiple membership variables, and extensions 
to an unbounded numbers of classes. Inference is straightforward, and quantities such as 
the probability of class membership are well defined and interpretable. We leave a full 
exploration of these for latter work. 



5.3. Network Sampling: Biased Seed Link-Tracing 

Handcock and Gile (2010) explored the idea of sampling networks by tracing the edges. As 
a general concept, link tracing involves selecting one or more seed nodes, and then observing 
the edges connected to those seeds. One or more of these edges are then followed to the 
neighbouring node, whose ties are observed, and the process is continued. Each iteration of 
this process is known as a wave. 

Provided that the seed nodes are chosen at random, and the method by which edges 
are chosen to be followed depends only on the observed data, this missingness process is 
ignorable. To be explicit, consider a link tracing process with It waves. Let Wi be the 
ordered set of nodes and edges sampled in the ith wave in the order in which they were 
sampled, w = {wq, ...,Wk}, and W-i = {wq, ■■■,Wi-i,Wi+i, ...,Wk}- If the seeds are chosen 
at random, and the edges followed by the sampling process are also chosen at random, then 
p(W = w\T = t,9) — p{W = w\Tobs = tobs,0), implying that the missingness is ignorable. 

In many cases, however, the seeds are not chosen at random from the population, but 
are some form of convenience sample. For example, in a population where some people have 
an infection and others do not, we may start with a sample of Si seeds picked at random 
from among the infected individuals, and s_i seeds picked from the non-infected individuals. 
These seeds are then used as a starting point for standard link tracing. We may then write 
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the sampling probability as 

p{w\t,d) = p{wo\t,9)p{'W-Q\tobs,wa,(^) 
(ui — SiV. (n^i — s_i)! 

where Ui and n_i are the number of infected and non- infected in the population, respectively. 
Note that p{w^Q\tobsiW(),6) does not depend on t^iss and may be factored out of the 
likelihood in equation ([2]). Thus there is no need to calculate p{'W-a\tobsTWo,6) explicitly, 
as it makes no impact on the likelihood. Hence, in this case, we can compute the likelihood 
without knowing the specific mechanism of seed selection. 



5.4. Network Sampling: Positive Contact Tracing 

As emerging epidemics develop, control measures (e.g., treatment, isolation and culling) 
focus on those members of the population that are known to have the infection. Because 
there are often many infected people who are unobserved, control can be ineffective (e.g., 
HIV (Potterat et al. 1989). The alternative of applying control measures to the entire 
population can be economically infeasible or ineffective (e.g., some instances of safe sex 
education) (Potterat et al. 1989 Klinkenberg et al. 2006[ ). Contact tracing is the hybrid 
approach of treating both the known infected individuals and those who may have been 
infected by them (Potterat et al. 1989 Klinkenberg et al. 2006 1. In U.S. public health. 



health clinics are required by state law to notify those at risk from infection due to their 
sexual relations with individuals tested, and found to be infected, by the clinic. The process 
of locating, notifying and then testing partners that may have been exposed to an infec- 
tious agent allows additional information about the partners to be collected. While the 
primary purpose of contact tracing is disease control via partner notification and partner 
services, it is also a form of data collection that is rarely utilised. Such approaches are used 
most commonly for syphilis and HIV/ AIDS, but also for other STIs such as gonorrhea and 



chlamydia (Golden et al. 2004), as well as routinely for tuberculosis and infectious disease 
outbreaks. Contact tracing has also been applied in many recent epidemics ( ,Fenner et al.[ 
^ " ~ " In positive contact tracing, we follow all 



1988 


Ferguson et al. 


2001 


Donnelly et al. 


2003 



edges from infected nodes, but edges from uninfected nodes are not followed. 

While the process varies from state to state and also by disease, we consider the following 
biased seed hnk tracing process: 

(a) Select s-i seed subjects at random from among the non-infected population, observe 
them. 

(b) Select Si seeds subjects at random from among the infected population, observe them. 

(c) Choose the next infected seed at random. 

(d) Observe all edges from the selected subject, and the infection status of these subjects. 

(e) For all infected neighbours of the selected subject, go to step 4. 

(f) If all the seeds have not been chain sampled, go to step 3 

We simulated a networked population of n = 1000 people from the simple homophily 
model of Section 2.1 with natural parameters of r; = (—5.8, .7, —1.95). The number of 
infected nodes was fixed at 150. The generated network had a mean degree of 3.1, and its 
degree distribution is displayed in Figure |4] There were 296 infected to non-infected ties, 
with the mixing distribution displayed in Figure [5] indicating moderate homophily. 
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Fig. 4. Degree distribution of tine networl<ed population. 




Fig. 5. Mixing statistics: Counts of tfie numbers of edges by tlie infection status of tine incident 
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Fig. 6. Sizes of tlie contact-traced samples based on 40 seed subjects (s^ = 40, s_i = 0). 

Starting with Si = 40 infected seeds, we simulated 100 positive link tracing samples for 
each of s_i = (0,45,90,135,180,225). Figure [6] displays a histogram of the sizes of the 
samples when there are no non- infected seeds (i.e., s^i — 0). 

To provide a comparison for our method we considered two estimators that could be 
utilised. Neither of them uses a model for the networked population but is motivated by 
approximations to the sampling design. The first treats the sample as a simple random 
sample 

Naive = n , 

where and n„ are the number of infected and uninfected in the sample respectively. The 
second adjusts for the sampling of the seeds 

Naive (seed adj.) ~ {n — Si — s_i) 1- s^. 

Tl'l S^l -\- Till ^ — i 

Our approach is to fit an ERNM to the contact tracing data. In this situation the contact 
tracing sampling design is clearly informative. For comparison, we compute two estimates 
of the model. The first takes into account the informal iveness of the contact tracing design 
(MNAR) and the other assume it is ignorable (MAR). These are based on the likelihoods 
[2] and [5j respectively, and the algorithm in Section |4j 

Figure [7] shows the results for each of the estimators over the samples. The median of 
the MNAR estimator is centred around the true value of 150 in all sampling scenarios, while 
the MAR estimator performs poorly with all infected seeds (s_i = 0) and increasingly well 
as the number of non-infected seeds increases to S-i = 225. This is somewhat expected as 
the proportion of infected in the seeds approximately matches that of the population when 
s_i = 225. The two naive estimators are significantly biased across all samples. This is 
especially true for the sample mean which is biased both by the seed selection and by the 
link-tracing design. The adjusted sample mean corrects somewhat for the seed bias but 
does not represent the link-tracing. 

This application illustrates the advantage of the model-based approach over the ad hoc 
estimators. By representing the structure of the networked population, the model-based 
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Fig. 7. Estimates via contact tracing with Si = 40 infected seeds and varying numbers of non- 
infected seeds. 



approach can leverage the information in the data more cfRciently. 



6. Discussion 



In this paper we have given a concise and systematic statistical framework for dealing with 
partially observed network data when some knowledge is available on the sampling design. 
The framework includes, but is not restricted to, ignorable sampling designs. We have also 
shown that likelihood-based inference is practical under partial observation for ERN models, 
and that the likelihood framework naturally accommodates standard sampling designs. 

We developed and implemented algorithms to compute Monte Carlo approximations to 
the likelihood, and showed how these can be used in practice. Three important special 
cases of these designs were demonstrated in Section [5] In Sub-section |5.1| we consider a 
missingness process which randomly selected dyads and nodal attributes to be missing. 



Sub-section 5.1 considers the case where all nodal attributes are missing, thus introducing 



a novel form of the latent cluster model. 



In Sub-section 5.3 we consider non-ignorable sampling in the context of contact tracing 
data, a case of vital importance to public health. At present, this is the first statistically 
defensible approach to inference in this form of data. The example presented here shows 
that the MLE estimation task is robust, in that it can be applied successfully to moderately 
large networks (1000 nodes), with significant missingness (^70% of nodes unobserved), but 
is limited by the fact that inference was performed on a simulated network. Whether 
the model presented here would provide a good fit for real public health data remains an 
important research question that we hope to address in the future. 
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Appendix: Algorithmic and Computational Details 

A. 1: Alternate MLE Formulation 

While the algorithm outlined in Section |4] works well, there are some situations where an 
alternate formulation using equation ([5]) may be useful. First let us consider the case where 
9 — 9q, then the likelihood is 

iivyeiVo) - log(i^,„(e(''-''°)-^(^)|U.))-log(i^,o(e('>-'>°)-^(^)))+log( g^^g.^^ "'^'i'lr" ^ '/"^^ 

^rio[-^\^ = "^K ,t/)\J^obs = tobs) 

(7) 

The first expectation, and the expectation in the denominator of the third term, can be 
calculated using an MCMC sample from p(t\tobs, Vo)- The second can be approximated with 
an MCMC sample from p(t|?7o)- The numerator of the third term can be approximated by 
importance sampling. 

k 

i 

where t^^^ ^ pit\tobs, Vo) and 



^ - ,fe 



If the sampling process is ignorable, then the third term drops out of the likelihood ratio. 
The first and second derivatives of the likelihood are useful in the maximisation process. 
For notational convenience let Ai{t) = gi{t) — E{gi{T)). 

■ log( ^ p{W = W\T = t)P{T,niss = tmiss\r],Tobstobs)P{Tobs = tobslv)) 



Sri Srji 



t„ 



^t^,ss = = t)P{Tmiss = t^iss\lhPobs = tobs)P[Tobs = tobs\v) 

EjpjW ^w\T)A,{T)\Tobs^tobs) 
E{p{W = w\T)\T,bs=tobs) 



5H 5 Et P{W = w\T^t)A,{t)P{T^,ss=tm^ss\V,Tobs^tobs)P{Tobs^ tobslv) 



Srjidr]j 6r]j Yl,t = '^l^ = t)P{Tmiss = tmiss\v,Tobs = tobs)P{Tobs = tobs\v) 

E{p{W ^ w\T)A,{T)\nts ^ tots)Eip{W = w\T)A^{T)\nks ^ t^ks) 
Eip{W = w\T)\Tobs=tobs)^ 

And if the missingness process is ignorable, these equations simplify to 

^= EiA,{T)\Tobs^tobs) 
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can be maximised to find the MLE of 6. 

Tliis motivates the following algorithm for maximising the observed data likelihood. 

(a) Let A; = and choose initial parameter values ri^^\ 9a- 

(b) Use MCMC to generate k samples, t'^^igg from P{tmiss\'>l'' ^tobs)- 

(c) Use MCMC to generate m samples t^*-* from P{t\ri''). 

(d) Set 6'*^+^ = a,Tgmax{E{P{w\T,9)\Tobs — tobs,'i]))i with samples from step 2 used to 
approximate the expectation. 

(e) Using the samples from steps 2 and 3 to approximate the relevant expectations, find 
^k+i niaximising equation ^ subject to llyy*^^^ ~ '7*^11 < 

(f) Set k — k + 1, and go to step 2. 

The disadvantage of this method is that if the networks generated by the MNAR process 
are very different from those generated assuming MAR, the estimates of the last expectation 
in equation ([?]) can become unstable. The benefit of using this method is that the sampling 
probability {P{W — w|T = t,6)) only needs to be calculated for networks included in the 
sample, and not at every MCMC step as is required by the algorithm in Section |4j so 
if the sampling probability is computationally expensive to calculate, this method can be 
significantly faster than the one outlined in Section [4] 

A.2: Estimating Network Statistics 

We can use MCMC samples from pitmissltobsiV) to estimate the network statistics of the 

sampled network. Suppose that we have used MCMC to draw k samples tj^^gg from the 

distribution p{tmiss\tobs,'r])^ and t^^^ — {tobs,t^miss)- Then we can estimate the expectation 
of a set of network statistics g as 



However, this equation ignores the possible bias introduced by our sampling process w. The 
distribution that we should be sampling from is the full conditional distribution of tmiss, 

piTraiss = tmiss\Tob = tobs.W = W,r])(X p{Tmiss = tmiss\Tobs = tobs,Tl)p{W = w\T t,9). 

We then use importance sampling to estimate the relevant quantity 




L{e\tobs,w,ri) cx P{tobs\ri)E{P{W ^ w\[ 
= E{P{W = w\T,e)\T,,, 




E{g{T)\tobs.r,)»-Y^g{t'^^). 



i=0 



Eig{T)\tobs,w,r],e) 



j:toPiw = w\T^m,e) 
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